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Abstract. In dynamic Monte Carlo simulations, using for example the Metropolis dynamic, it is 
often required to simulate for long times and to simulate large systems. We present an overview of 
advanced algorithms to simulate for longer times and to simulate larger systems. The longer-time 
algorithm focused on is the Monte Carlo with Absorbing Markov Chains (MCAMC) algorithm. It is 
applied to metastability of an Ising model on a small-world network. Simulations of larger systems 
often require the use of non-trivial parallelization. Non-trivial parallelization of dynamic Monte 
Carlo is shown to allow scalable algorithms, and the theoretical efficiency of such algorithms are 
described. 



INTRODUCTION 

Dynamic Monte Carlo is used when dynamic information about a particular system is 
required. For example, for spin- 1/2 lattice systems, starting from a quantum Hamiltonian 
coupled to a heat bath, the underlying dynamic for the Ising model can be derived as: 1) 
randomly and uniformly choose one spin; 2) decide whether or not to flip the spin based 
on a spin-flip probability p. The functional form for p may for instance be Metropolis 
111], Glauber (derivable from coupling the quantum system to a fermionic heat bath [2|]), 
or a form obtained from coupling the quantum system to a bosonic heat bath |3]. Since 
the simulated dynamic is defined by the underlying physical system, it should not be 
altered. While remaining faithful to the dynamic, algorithms that allow for long-time 
simulations and non-trivial parallelization are still possible. Some of these algorithms 
will be presented (for a review see [4]). 

In this article we review the use of the Monte Carlo with Absorbing Markov Chains 
(MCAMC) method and apply the method to an Ising ferromagnet on a small-world 
network. We also describe the use of ideas from non-equilibrium surface science to 
study the theoretical scalability of non-trivial parallelization applied to parallel discrete 
event simulations (PDES), such as the dynamic Monte Carlo method. 



TABLE 1. The spin arrangements for the first 7 of 12 spin classes used in the 
MCAMC calculations. The energies associated with these spin configurations enter 
the spin flip probabilities, /?,-. 
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FASTER DYNAMIC METROPOLIS SIMULATIONS 

In dynamic Monte Carlo simulations, the dynamic is given by the underlying physical 
system, so it cannot be changed. Consequently, many of the well-known algorithms, 
such as loop algorithms, cluster algorithms, and multicanonical algorithms cannot be 
used since they are not faithful to the dynamic. Furthermore, one Monte Carlo step 
per spin (MCSS) corresponds to an underlying microscopic time [3], which often is 
much shorter than the time scale needed for the simulation. For example, in simulating 
ferromagnets a Monte Carlo step is approximately an inverse phonon frequency 
about 10 seconds. The lifetime of a metastable state desired for device time scales 
is years for magnetic recording. In modeling paleomagnetism, the time scales of the 
metastable state are millions of years. To simulate over such disparate time scales 
requires faster-than-real-time algorithms. 

Whenever the rejection rate is high, event-driven rejection-free methods are useful. 
These include the n-fold way [5] and its generalization to the MCAMC method [6]. A 
rejection-free algorithm for continuous spin systems has recently been published |7]. 
An alternative algorithm for first-passage times is the projective dynamics method 
These algorithms can often accelerate simulations by many orders of magnitude. 

Here we apply the MCAMC method to study metastability of the Ising model on a 
small- world network. The Hamiltonian is 
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Here a,- = ±1, 7i is the ferromagnetic interaction along the chain, J2 is a ferromagnetic 
interaction for the small-world connections (see below), and H is the applied external 
field. We use periodic boundary conditions for the Ising spins. Each Ising spin has 
one small-world connection. It is obtained by starting with the first spin, and randomly 
connecting it to any of the other A'^ — 1 spins. If the next spin is not yet connected with a 
small- world connection, one of the remaining unconnected spins is randomly connected 
to it. These connections are quenched, and do not change in a particular simulation. 
Many quenched random small-world bond configurations are needed to determine the 
effect of the randomness. 




FIGURE 1. (a) The Binder fourth-order cumulant for the order parameter for the Ising ferromagnet on 
a small-world network with H = 0. The crossings for various system sizes gives an estimate of the critical 
temperature, (b) The average lifetime in MCSS for// = — 0.27i and H = — 0.47i. Note the large lifetimes. 



The applied Monte Carlo dynamic is: 1) one of the A'^ spins is chosen at random, 2) 
a uniform deviate r on (0, 1] is chosen, and 3) the chosen spin is flipped if the Glauber 
flip probability [9] satisfies r < exp(— ^new/T") / [exp(— ^new/T") +exp(— ^oid/r)]. Here 
Boltzman's constant has been set to unity, ^oid is the energy of the current spin config- 
uration, and finew is the energy of the spin configuration with the chosen spin flipped. 
We start with all spins C7 = +1, apply a field H < 0, and measure the lifetime T until 
the magnetization is first equal to zero. Using the same quenched random small- world 
bonds, we average over many such escapes to obtain the average lifetime (t) measured 
in MCSS. 

To measure a metastable lifetime, one needs to be below the critical temperature, Tc. 
We estimate Tc using the Binder fourth-order cumulant of the order parameter [9]. Simi- 
lar equilibrium studies of small- world Ising ferromagnets have recently been performed 
The crossings of this cumulant provide a straightforward way of estimating Tc 
(Fig. 1(a)). The average lifetime for T <Tc grows exponentially in T^ (Fig. 1(b)). This 
necessitates the use of faster-than-real-time algorithms. 

One way of accelerating the computations is to use a rejection-free algorithm. This 
includes the n-fold way algorithm Bfl] in continuous time, but it also has a counterpart in 
discrete time 0,0]. When all spins are -|-1, then the probability of flipping a single spin 
in one step is pi and the average time required before a spin flips is pj^^. Therefore, for 
small Pi, computations can be accelerated by asking how long it takes to change from 
the state of all spins up to the state with one overturned spin. This is an example of the 
5=1 MCAMC algorithm (5=1 transient state, the current state). Whenever the spins 
are all +1, the time increment m = [ ln(ri)/ln(l — pi)\ + 1 is added, and a randomly 
chosen spin is flipped. Here ri is a uniformly distributed random number on (0, 1], [-J is 
the integer part, and all spins are equivalent, so we can randomly pick one to flip (in the 
language of the n-fold way algorithm, all spins are in the same spin class). 

At low temperatures and small fields the s = I MCAMC algorithm still does not give 
the best performance. However, the performance can be improved by adding additional 



states to the transient subspace. For example, for = 2 in this model, the transient part 
of the absorbing Markov chain is 



T^|^l--^-WA^) P^l^y (2) 

Here x is defined below. Then, whenever all spins are +1, the time increment m that is 
added to T, corresponding to exiting to a state with two overturned spins, is given using 
a uniform random deviate ri in (0, 1] by the solution of 

vjT"'e<r^<vjT"'-^e (3) 

where = ( 1 1 ) and the initial vector is vj^ = ( 1 ) . 

Once the time increment m to exit the transient subspace is obtained, the next spin 
configuration must be found, i.e. a configuration with two overturned spins. Let N2 be the 
number of small- world bonds that connect nearest-neighbor (nn) spins. Let x = x\+X2 
with 

N-Nj N-Ni 
^1 = -^[^P2 + PA + {N-A)p,] = ^^y, (4) 

■^2 = J^[P5+P2 + {N-^)pi\=J^y2. (5) 

Then the new spin configuration is chosen, using uniformly distributed random numbers 
r,- for / = 2, ■ ■ ■ ,6. 

If r2X > x\, one of the spin pairs with small-world bonds longer than nn is randomly 
chosen using r3, one of these two spins is chosen with and is flipped. If r^y\ < 2p2, 
using rg one of the two nn spins along the chain is chosen and flipped. If 2p2 < r^yi < 
2p2 + {N — 4)pi, the spin connected to the flipped spin by the small-world bond is 
flipped. If neither of the two conditions above involving is satisfied, then rs is used to 
choose one of the other A'^ — 4 spins (except the flipped spin or the 3 spins it is connected 
to), and the chosen spin is flipped. 

If r2X <xi a similar procedure is used for spins belonging to the N2 doubly-connected 
bonds. 

The MCAMC algorithms do not change the dynamics, but rather only implements 
the dynamics in a fashion that enables simulations to longer lifetimes. Results for the 
average lifetime obtained from 10^ escapes for one realization of the quenched small- 
world bonds are shown in Fig. 1(b). 



IS THE METROPOLIS DYNAMIC PARALLELIZABLE? 

Dynamic Monte Carlo and event-driven rejection-free Monte Carlo methods belong to a 
class of problems called discrete-event simulations (DES). Non-trivial parallelization of 
dynamic Monte Carlo and n-fold way algorithms has been accomplished for Ising spin 
systems Jslllll]. Using ideas and methodologies of non-equilibrium surface science, it 
has recently been shown [ll2i1 that conservative PDES implementations should have a 



virtual time horizon in the Kardar-Parisi-Zhang (KPZ) universality class lll3h . Provided 
that this is the case, then all short-ranged asynchronous parallel DES simulations can 
be made to be perfectly scalable. This is because, as the number of processing elements 
(PEs) goes to infinity, the utilization stays finite ifl^l . and the measurement portion of 
the algorithm can be bounded flT]. A brief review is presented here. 

The stochastic nature of the Metropolis dynamic makes it difficult to utilize a parallel 
computing environment to the fullest extent because a priori there is no global clock to 
synchronize physical processes in a system with asynchronous dynamics. However, the 
system is not inherently serial. 

The methodology for PDES simulations works in all dimensions, but for simplicity we 
consider parallelization of dynamic Monte Carlo for a one-dimensional Ising model. In 
non-trivial parallelization, the spin system is spatially distributed among L processing 
elements, i.e., physical processes and interactions between physical subsystems are 
mapped to logical processes and logical dependences between PEs (Fig.EJ. In our model 
of PDES performance for the spin system with nn interactions, we consider an ideal 
system of L identical PEs, arranged on a ring, where communications between PEs take 
place instantaneously. Each PE manages the state of the assigned subsystem of A'^ spins, 
and has its own time (called the local virtual time, LVT). The LVT progresses on each 
PE during the simulation. The asynchronous nature of physical dynamics implies an 
asynchronous system of logical processes. Logical processes execute concurrently and 
exchange time-stamped messages to perform state updates of the entire physical system 
being simulated. A sufficient condition for preserving causality in simulations requires 
that each logical process works out the received messages from other logical processes 
in non-decreasing time-stamp order 1 15 . [l6ll . PDES are classified in two categories: 
optimistic llisll and conservative 1 17, T§, 19]. In conservative PDES, an algorithm does 
not allow a logical process to advance its LVT (i.e., to proceed with computations) until it 
is certain that no causality violation can occur. In optimistic PDES, an algorithm allows 
a logical process to advance its LVT regardless of the possibility of a causality error. 
The optimistic scenario detects causality errors and provides a recovery procedure to 
detect and fix such errors. Several aspects of a PDES algorithm should be considered in 
efficiency studies, including: the synchronization procedures; the average utilization (a) 
of the parallel environment as measured by the mean fraction of working PEs between 
update attempts; the memory requirements per PE; and the scalability as measured by 
evaluating the performance when L is increased. 

In our study the main concept is the virtual time horizon (VTH), defined as the set 
of the LVTs for all logical processes. We model the growth of the VTH as a deposition 
process of Poisson random time increments on a one-dimensional lattice of L processors. 
The growth rule of the VTH is defined by the PDES algorithm. The width of the VTH 
provides a measure of the desynchronization in the system of PEs and is related to the 
memory requirements for parallel simulations I21I1 . Here the principle is: the 

larger the width, the larger the memory required per PE. The asymptotic scalability of 
an algorithm can be assessed by applying coarse-grained methods to the VTH [1^. 
Computational speed-up (as measured by comparing the performance of the parallel 
with sequential simulations) can be derived from the microscopic structure of the VTH 

In]. 

In modeling a conservative PDES, at each update attempt t, on each PE the simulation 



algorithm randomly selects one of the A'^ spin sites. If the selected site is an interior site, 
the update happens and the simulated LVT is incremented for the next update attempt: 
T(f + 1 ) = t(?) + 77 , where 7] is a random time increment that is sampled from the Poisson 
distribution with unit mean. If the selected site is a border site, the PE must wait until the 
LVT of its neighbor(s) is not less than its own LVT, at which time the waiting PE makes 
the update and proceeds. For N = I the LVT of both neighboring PEs are considered, 
while for N > I only the corresponding neighboring PL's LVT is considered. 

In the most unfavorable case of conservative parallelization N = I. For such a closed 
spin chain the mean utilization {u{L;N = 1)) of the parallel processing environment is 
simply the mean density of local minima in the conservative VTH during the steady 
state. Analyzing the microscopic structure of the VTH at saturation, it is possible to 
derive approximate analytical formulas for {u{L;N)) and the higher moments of u{L;N) 
(Fig. 3(a)). For example, 

(m(L;1)) = (L+1)/4L, L>3 
(m(L;2)) = (3L+1)/8L, L>3 ' 

Note that as L — > 00, the utilization is about 1/4 for A'^ = 1 and about 3/8 for N — 2. For 
large A'^, the asymptotic utilization can be near the theoretical limit of unity. 

The conservative PDES utilization depends on A'^, as well as on the number A^^ of 
effective border lattice sites per PE (here A'^t, = 2), and on the communication topology. 
Our earlier large-scale simulations [21] show that the worst-case (A^ = 1) conservative 
scenario for a spin chain can be greatly improved when A'^ is increased while retaining the 
ring communication topology with Nt, = 2 (Fig.|3tb)). Thus, to take the best advantage 
of conservative parallelization one should use many PEs with many spins per PE [see 
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FIGURE 2. The mapping of short-ranged physical processes to logical processes. The nn physical 
interactions (two-sided arrows in the left part) on a lattice with periodic boundary conditions are mapped 
to the ring communication topology of logical processes (two-sided arrows in the right part). Each PE 
carries lattice sites, but communications take place only for border sites. 
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FIGURE 3. (a) The steady-state mean utilization vs the system size in conservative PDES for a spin 
chain with N = 1. Analytical result (solid curve); infinite-L limit (dashed line); and simulation data 
(symbols), (b) The time evolution of the mean utilization in conservative PDES for spin chains when 
each PE carries spin sites, two of which are the effective border sites. Observe that the utiUzation grows 
with A^. 



Fig-Etb)]. In this case, preliminary analysis of the width of the VTH shows it scales as: 

where a = 1/2 and the cross-over times are: tix ~ N independent of L; ?2x ~ NU. 
Here z = a//32 is the dynamic exponent, and the growth exponents are /3i ~ 1/2 (cor- 
responding to random deposition) and /32 ~ 1/3 (corresponding to the KPZ universality 
class) [ fl3l] . The scaling exponent a = 1/2 of the VTH width at saturation (for t ^ t2x) 
implies that the memory requirement for the state savings grows as a power law, i.e., 
as \/LN. Recent applications of the conservative algorithm to modeling magnetization 
switchin g [.11.1 and a dynamic phase transition in highly anisotropic thin-film ferromag- 
nets |23,|24|] indicate that conservative parallelization can be very efficient in simulating 
spin dynamics with short-range interactions. 



DISCUSSION AND CONCLUSIONS 

This brief paper has described how to make dynamic Monte Carlo simulations faster and 
larger. The algorithms described do not change the dynamics in any fashion, but rather 
implement the dynamics on the computers using advanced techniques. 

To accelerate the simulations, faster-than-real-time algorithms may be implemented. 
These include the n-fold way algorithm [|5|] and its extension, the Monte Carlo with 
Absorbing Markov Chain (MCAMC) algorithm as well as the projective dynamics 



method fs*]. We outlined s = 1 and s = 2 MCAMC methods, as applied to magnetic field- 
reversal in a ferromagnetic Ising model on a small-world network. 

To make the simulations larger, non-trivial parallelization is required. We briefly de- 
scribed how ideas from non-equilibrium surface science can be used to understand such 
simulations. In particular, all short-ranged conservative PDES should have a virtual time 
horizon governed by the KPZ universality class. In this case, all short-ranged PDES 
(such as dynamic Monte Carlo) can be made scalable u sing a conservative PDES ap- 
proach. Conservative PDES references include l4l ll7LlllL ll^ . The alternative imple- 
mentation, optimistic PDES simulations 1 15, 16] for dynamic Monte Carlo simulations, 
have shown some of the difficulties with scalability [25]. 
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